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Motivated by the problem of the definition of a distance between two sequences of characters, we 
, investigate the so-called learning process of typical sequential data compression schemes. We focus 

on the problem of how a compression algorithm optimizes its features at the interface between two 
different sequences A and B while zipping the sequence A -\- B obtained by simply appending B 
after A. We show the existence of a universal scaling function (the so-called learning function) which 
. rules the way in which the compression algorithm learns a sequence B after having compressed a 

sequence A. In particular it turns out that it exists a crossover length for the sequence B, which 
depends on the relative entropy between A and B, below which the compression algorithm does not 
learn the sequence B (measuring in this way the relative entropy between A and B) and above which 
i-C ' it starts learning B, i.e. optimizing the compression using the specific features of B. We check the 

scaling function on three main classes of systems: Bernoulli schemes, Markovian sequences and the 
symbolic dynamic generated by a non trivial chaotic system (the Lozi map) . As a last application of 
the method we present the results of a recognition experiment, namely recognize which dynamical 
systems produced a given time sequence. We finally point out the potentiality of these results 
, for segmentation purposes, i.e. the identification of homogeneous sub-sequences in heterogeneous 

■ sequences (with applications in various fields from genetic to time-series analysis). 
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I. INTRODUCTION 



The modern approach to time series analysis based on the theory of dynamical systems and information theory has 
represented a major advance in the description and comprehension of a wide range of phenomena, from geophysics to 
industrial processes ||. 

Time series represent a particular example of the wider category of strings of characters which also includes as further 
examples texts or genetic sequences (DNA, proteins). When analyzing a string of characters the main question is to 
extract the information it brings. For example in a DNA sequence this would correspond to the identification of the 
subsequences codifying the genes and their specific functions. 
fS| ' On the other hand for a written text one is interested in understanding it, i.e. recognize the language in which the 
CO . text is written, its author, the subject treated and eventually the historical background. More in general one would 
■ be interested in the extraction of specific features and trends that could help in characterizing the sequences. With a 
slight misuse of the word we shall refer in the following to this kind of information as to semantic information because 
it refers to some underlying meaning whose quantification is far from being a trivial task [0. 



In the spirit of having specific tools for the measurements of the amount of information brought by a sec 



w „ uence, it is 

rather natural to approach the problem from a very interesting point of view: that of information theory ||, Born 
in the context of electric communications, information theory has acquired, since the seminal paper of Shannon jij, a 
" ■ leading role in many other fields as computer science, cryptography, biology and physics In this context the word 
'■^ ' information acquires a very precise meaning, namely that of the entropy of the string, a measure of the surprise the 
source emitting the sequences can reserve to us. Besides the notion of entropy, information theory provides us with a 
O ' series of tools for more sophisticated measures of complexity. In this perspective another important concept is related 
to the definition of a suitable measure of remoteness between pairs of sequences. This kind of measure can be crucial 
for the implementation of algorithms aimed at recognition purposes. 

As it is evident the word information is used with different meanings in different contexts. Suppose now for a while 
to be able to measure the entropy of a given sequence. It is possible to obtain from this measure the information (in 
the semantic sense) we were trying to extract from the sequence? The answer to this question is again far from being 
trivial. We shall nevertheless show in the following that at least partial answers can be drawn within an a posteriori 
reasoning. 

Having this plan in mind the first logical step is to provide us with some tools to be used in the measurements 
of the amount of information contained in a given string or in the relative amount of information between pairs of 
strings. In this spirit this paper is mainly focused on the so-called compression algorithms, i.e. the zippers. 

It is well known that compression algorithms provide a powerful tool for the measure of the entropy and more in 
general for the estimation of more sophisticated measures of complexity ^, ||]. Since the entropy of a string fixes 
the minimum number of bits one should use to reproduce it, it is intuitive that a typical zipper, besides trying to 
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reduce the space occupied on a memory storage device, can be considered as an entropy meter. Better will be the 
compression algorithm, closer will be the length of the zipped file to the minimal entropic limit and better will be the 
estimate of the entropy provided by the zipper. 

A great improvement in the field of data compression has been represented by the so-called Lempel and Ziv 77 
algorithm (LZ77) 0. As we shall see in the following, this algorithm zips a file by exploting the existence of repeated 
sub-sequences of characters in it. Its compression efficiency becomes optimal as the length of the file goes to infinity. 

Starting from this idea Merhav and Ziv proposed a general method which allows to compute very precisely the 
relative entropy between two sequences of characters. A similar approach has been proposed by Milosavljevic 



Another contribution in this direction is due to Farach et al. |10 in which it is shown that the best way to estimate 



the entropy of a source is to evaluate the relative entropy of two sequences emitted by the source itself. 

The relative entropy can be considered as a measure of the remoteness of two sequences and it represents a very 
important tool to be used for instance for classification purposes. A severe limitation arises from the fact that the 
relative entropy is not a distance in a mathematical sense: it is not symmetrical and it does not satisfy the triangular 
inequality. In many application, like for instance phylogenesis, it is fundamental to define a real distance between 
sequences. 

In this perspective a very important contribution has been given by the group of Li in p| where it has been 
proposed a rigorous definition of distance between unaligned sequences using the information theoretical concepts of 
Kolmogorov complexity [ p2[ . The computation of this distance is implemented by means of original data compression 
techniques |l3| . 

Moreover in the field of the so-called computational linguistics there have been several contributions showing 
how data compression techniques could be useful in solving different problems (an incomplete list would include 
p^ , [Ts] , p^ , p^ , 19, |2l]): language recognition, authorship recognition or attribution, language classification, 
classification of large corpora by subject, etc. But of course the possibilities of data-compression based methods go 
beyond computational linguistics. Another important example is that of genetic motivated problems. Here also there 
have been important contributions: for an incomplete bibliography we refer to |2^, and references therein. 
Another important field of application is represented by the theory of Dynamical Systems [2^, 25, 

It is evident how the specific features of data compression techniques make them potentially very important for 
fields where the human intuition can fail: DNA and protein sequences as already mentioned but more in general 
geological time series, stock market data, medical monitoring, etc. 

Some of us have recently proposed a method [pTf for context recognition and context classification of strings of 
characters or other equivalent coded information. The remoteness between two sequences A and B was estimated 
by zipping a sequence A + B obtained by appending the sequence B after the sequence A (using the gzip com- 
pressor |2^). This idea is used for authorship attribution and, defining a suitable distance between sequences, for 
languages phylogenesis. 

The idea of appending two files and zip the resulting file in order to measure the remoteness between them had been 
previously proposed by Loewenstern et al. |^ (using zdiff routines) who applied it to the analysis of DNA sequences, 
and by Khmelev ||l9| who applied the method to authorship attribution. In particular here the method is extensively 
tested using many different zippers, including gzip. Though the idea is the same the practical implementation differs 
from the one proposed in 

On the same stream line other very similar approaches have been proposed. In particular Teahan (see jist and 
references therein) performs authorship attribution and text classification using Prediction by Partial Matching algo- 
rithms {PPM) ||2^ data compression algorithms; Juola (see [|l6j and references therein) using an algorithm proposed by 
Wyner to measure relative entropy perform languages phylogenesis; Thaper [p^ performs experiments of authorship 
attribution using PPM and LZ78 |29 algorithms. 

In this paper we extend the analysis of [ pl| by considering more in details the features of data compression algorithms 
when applied to generic strings of characters. The specific question we raise here is how LZ77-like compression 
algorithms behave at the interface between two different files. More specifically we shall focus on the process by which 
a typical zipper learns the sequence it is processing and how it uses previous information acquired while zipping a 
given file to zip a second different file. We point out in particular the existence of a scaling function which rules the 
way in which the compression algorithm learns the sequence B after having zipped sequence A. Let us notice how 
this kind of problems is closely related to the so-called segmentation problem, i.e. the identification of homogeneous 
sub-sequences in heterogeneous sequences (with applications in various fields from genetic to time-series analysis). 

Since in this case we are interested in exploring the features of the compression algorithms we shall use as benchmark 
systems time sequences issued by dynamical systems of increasing complexity. In particular the scaling function is 
checked numerically for three main classes of systems: Bernoulli schemes, Markovian sequences and the non trivial 
symbolic dynamic generated by the so-called Lozi map. As a last application of the method we present the results of 
a recognition experiment, namely recognize which dynamical systems produced a given time sequence. 

The outline of the paper is as follows. In section II we recall some basic definitions. In section III we recall the 
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definition of relative entropy and the Merhav and Ziv algorithm. In section IV we study analytically what happens 
when applying LZ77 algorithm to a sequence obtained appending two different sequences. In Section V we analyze 
numerically the results of section IV. In Section VI we perform a recognition experiment on sequences generated 
by the Lozi map. Finally in section VII we draw the conclusions and discuss possible fields of application for these 
techniques. 



II. BASIC CONCEPTS 

Originally information theory (IT) was introduced by Shannon Q in the practical context of electric communi- 
cations. The powerful concepts and techniques of IT allow for a systematic study of sources emitting sequences of 
discrete symbols (e.g. binary digit sequences) and in the last decades it has been shown the deep relations between 
IT and other fields as computer science, cryptography, biology and chaotic systems [|[ p6| . 

Consider a symbolic sequence S{1), 5(2), S{3), . . . where S{t) is the symbol emitted at time t and each S can assume 
one of m different values. Assuming that the sequence is stationary we introduce the A^-block entropy: 

Hn = - P(^N)lnPiCN) (1) 

{Cn} 

where P(CAr) is the probability of the iV-word Cn = [S{t),S{t + I), . . . , S{t + N - 1)]. The difference 



hw = ^^Af+i — (2) 

has a rather obvious meaning: it is the average information supplied by the {N + l)-th symbol, provided the N 
previous ones are known. Noting that the knowledge of a longer past history cannot increase the uncertainty on the 
next outcome, one has that cannot increase with N i.e. /itv+i < h^. Now we are ready to introduce the Shannon 
entropy for an ergodic stationary process: 



h = lim hff = lim — —. (3) 

It is easy to see that for a fc-th order Markov process (i.e. such that the conditional probability to have a given 
symbol only depends on the last k symbols, P{S{t)\S{t - I), S{t - 2), . . .) = P{S{t)\S{t - 1), S{t - 2), . . . , S{t - k))) 
hN = h ioT N > k. 

The Shannon entropy /i is a measure of the "surprise" the source emitting the sequence reserves to us. Let's try 
to explain in which sense entropy can be considered as a measure of a surprise. Suppose that the surprise one feels 
upon learning that an event E has occurred depends only on the probability of E. If the event occurs with probability 
1 (sure!) our surprise in its occurring will be zero. On the other hand if the probability of occurrence of the event E 
is quite small our surprise will be proportionally large. For a single event occurring with probability p the surprise 
is proportional to InP. Let's consider now a random variable X, which can take N possible values xi, ...,xn with 
probabilities pi, ...,pn, the expected amount of surprise we shall receive upon learning the value of X is given precisely 
by the entropy of the source emitting the random variable X, i.e. — '^pilnpi. 

A theorem, due to Shannon and McMillan ||, ||, expresses in a precise way how h quantifies the "complexity" of 
the source: if N is large enough, the set of A^-words {Cn} can be partitioned in two classes, ni{N) and n2{N) such 
that all the words Cn G have probability P{Cn) ^ exp{—hN) and 

P{Cn) ^1 for iV ^ oo (4a) 

Civeni(w) 

Y P{Cn) ^0 for iV ^ oo. (4b) 

CNe02(Af) 

An important implication of the theorem is that the number of typical sequences Meff{N) (those in Vli^N)) effectively 
observable is 



Meff{N) ~ e 



(5) 
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Note that in non-trivial cases, in which h < Inm, Meff{N) <C m^, being the total number of possible A''-words. 
Let us remark that the Shannon-McMillan theorem for processes without memory is nothing but the law of large 
numbers. Equation is somehow the equivalent in IT of the Boltzmann relation in statistical thermodynamics 
S" (X In W ^ W being the number of possible microscopic states and S the thermodynamic entropy. It is remarkable the 
fact that, under rather natural assumptions, the Shannon entropy h apart from a multiplicative factor, is the unique 
quantity which characterizes the "surprise" . 

An important result is the relation between the maximum compression rate of a sequence (S'(l), >S'(2), S'(3), . . . ), 
expressed in an alphabet with m symbols, and h. If the length T of the sequence is large enough, then it is not 
possible to compress it into another sequence (with an alphabet with M symbols) whose size is smaller than Th/ InM. 
Therefore, noting that the number of bits needed for a symbol in an alphabet with M symbol is In M, one has that the 
maximum allowed compression rate is h/lnM. Perhaps the simplest way to compress, at least at a conceptual level, 
is via the Shannon- Fano procedure which is able to reach asymptotically the maximum allowed compression rate . 
Also the popular Lempel-Ziv coding Q (see in the following for a short discussion) gives the same asymptotic results. 

We stress the fact that h is an asymptotic quantity which gives the behavior of Hn (or equivalently Hn) at large 
N, i.e. h ~ for N ^ 1. On the other hand the features of Hn (or Hn) for moderate N are rather important in 
all nontrivial processes (i.e. with memory). Grassberger | |35[ | proposed a way to characterize the behavior of Hpf. Let 
us introduce 

Shpf = hN-i - hN, (6) 

and the effective measure complexity C: 



C=Y^ NShN- (7) 



N=l 

It is not difficult to realize that, for large N, one has 



Hn^C + hN. (8) 

In trivial processes (e.g. Bernoulli schemes), C = 0, on the other hand C can be nonzero in cases with zero h (e.g. 
periodic sequences). 



III. DATA COMPRESSION AND COMPLEXITY 

As already mentioned it exists an important relation between the maximum compression rate achievable for a given 
sequence and its Shannon entropy. The problem of the optimal coding for a text (or an image or any other kind of 
information) has then to face with the intrinsic limit to encode a given sequence: the entropy of the sequence. We 
have also mentioned that there are many equivalent definitions of entropy but probably the best definition for our 
purposes in this paper is the Chaitin - Kolmogorov complexity ^ 34 : the algorithmic complexity of a string 
of characters is given by the length (in bits) of the smallest program which produces as output the string. A string 
is said complex if its complexity is proportional to its length. This definition is really abstract, in particular it is 
impossible, even in principle, to find such a program ]T^ . Since this definition tells nothing about the time the best 
program should take to reproduce the sequence, one can never be sure that somewhere else it does not exist another 
shorter program that will eventually produce the string as output in a larger (eventually infinite) time. 

Despite this difficulty one has to recall that there are algorithms explicitly conceived to approach the theoretical 
limit of the optimal coding. These are the file compressors or zippers. A zipper takes a file and try to transform it in 
the shortest possible file. Obviously this is not the best way to encode the file but it represents a good approximation 
of it. One of the first compression algorithms is the Lempel and Ziv algorithm (LZ77) ||^ (used for instance by gzip 
and zip). It is interesting to briefiy recall how it works. The LZ77 algorithm finds duplicated strings in the input data. 
The second occurrence of a string is replaced by a pointer to the previous string given by two numbers: a distance, 
representing how far back into the window the sequence starts, and a length, representing the number of characters 
for which the sequence is identical. For example, in the compression of an English text, an occurrence of the sequence 
"the" will be represented by the pair (d, 3), where d is the distance between the occurrences we are considering and 
the previous one. It is important to mention as the zipper does not recognize the word "the" as a dictionary word but 
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only as a specific sequence of characters without any reference to the words belonging to a specific dictionary. The 
sequence will be then encoded with a number of bits equal to (logj (d) + logj (3)): i.e. the number of bits necessary 
to encode d and 3. Roughly speaking the average distance between two consecutive "the" in an English text is of the 
order of 10 characters. Therefore the sequence "the" will be encoded with less then 1 byte instead of 3 bytes. 

LZ77 algorithm has the following remarkable property: if it encodes a text of length L emitted by an ergodic source 
whose entropy per character is h, then the length of the zipped file divided by the length of the original file tends to 
h when the length of the text tends to oo. In other words it does not encode the file in the best way but it does it 
better and better as the length of the file increases. More precisely the code rate, i.e. the number of bits per character 
needed to encode the sequence, can be written as: 



number of bits to encode the phrase Inn + ln_L„ + 0(lnlnL„) 

code rate ~ — ; — ~ — - — , (9) 

length of a phrase x in 2 _L„ m 2 

where L„ is the length of the phrase substituted and n is the length of the part of the sequence already analyzed. Note 
that Inn/ In 2 is the number of bits needed to encode the part of the pointer describing the distance, while lnL„/ln2 
is the number of bits needed to encode the part of the pointer describing the length of the substitution. Recalling |3^ 
that for n — > (X) one has that L„ —> {in probability) one obtains 



/i „ / In In n \ , , 

code rate ~ - — +0{ , 10) 

ln2 \ Inn / 

i.e. the LZ77 algorithm converges asymptotically to the Shannon entropy even though the convergence is extremely 
slow. The presence of the term In 2 is due to the fact that in the definition of h in (|l|) we use the natural logarithm. 

It is important to remind that the redundancy of the LZ77 coding has been rigorously determined by Savari in [ p7[ . 

A variant of LZ77 has been introduced in 1978 under the denomination of LZ78 129]. In this case the algorithm 
starts reading the text. Whenever it meets a new sequence of characters it associates a new character equivalent 
to this sequence. From now onward whenever it meets this sequence this will be encoded with a single character. 
Essentially after a while the algorithm will be able to encode the typical sequence of characters in a short way. For 
LZ78 thus, after a while, the sequence "the" will be then associated to a new character. More precisely, after a while, 
the zipper is able to encode the word "the" not with 3 bytes but only with 3 — 4 bits. 

The first conclusion one can draw is therefore about the possibility to measure the entropy of a large enough 
sequence simply by zipping it. For example if one compresses an English text the length of the zipped file is typically 
of the order of one fourth of the length of the initial file. An English file is encoded with 1 byte (8 bits) per character. 
This means that after the compression the file is encoded with about 2 bits per character. Obviously this is not yet 
optimal. Shannon with an ingenious experiment showed that the entropy of the English text is something between 
0.6 and 1.3 bits per character [Q. 



A. Relative entropy 



Another important quantity we need to recall is the notion of relative entropy or Kullback-Leibler divergence p9| , 
|4l|] which is a measure of the statistical remoteness between two distributions. Its essence can be easily grasped 
with the following example. Let us consider two ergodic sources A and B emitting sequences of and 1: A emits a 
with probability p and 1 with probability 1 — p while B emits with probability q and 1 with probability 1 — q. As 
already described, the compression algorithm applied to a sequence emitted by A will be able to encode the sequence 
almost optimally, i.e. coding a with — logjp bits and a 1 with — log2(l — p) bits. This optimal coding will not 
be the optimal one for the sequence emitted by B. In particular the entropy per character of the sequence emitted 
by B in the coding optimal for A will be —q lap — (1 — g) ln(l — p) while the entropy per character of the sequence 
emitted by B in its optimal coding is ~q Inq — (1 — q) ln(l — q). The number of bits per character wasted to encode 
the sequence emitted by B with the coding optimal for A is the relative entropy of A and B, 



D{B\\A)^D{q\\p)^-q\nP-{l-q) In^. (11) 

q 1-q 

A linguistic example will help to clarify the situation: transmitting an Italian text with a Morse code optimized for 
English will result in the need of transmitting an extra number of bits with respect to another coding optimized for 
Italian: the difference is a measure of the relative entropy. 
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The general definition of the relative entropy for two generic distributions pn{^) and (7jv(x), i.e. the probability 
distributions for sequences x of A'^ characters emitted by two different sources, is given by 



DN{qN\\PN) QN (x) In '^^l . (12) 



Let us stress that in general Dj\j{q]\j\\pN) could be infinite simply because of sequences emitted by the first source 
and not existing in the second (i.e. pAr(x) = for some sequences x). 



B. Merhav-Ziv algorithm for the computation of the relative entropy 



It is interesting to recall the algorithm proposed by Merhav and Ziv ||8|] for the measurement of the relative entropy. 
The method is based on a procedure very similar to the one used in the LZ77. This procedure is called parsing and it 
consists in a sequential search for phrases each one is the shortest string which is not a previously parsed phrase (notice 
that this is a self-parsing procedure since the algorithm only considers one sequence). In LZ77 this would be the longest 
string for which there is a match in the window already analyzed. The Merhav-Ziv algorithm modifies this self-parsing 
procedure by introducing a cross-parsing, i.e. a sequential parsing of a sequence z with respect to another sequence 
X. In practice one has to find the largest integer m such that the sub-sequence (zi, Zm) = {xi, Xi+i, Xi+m-i) for 
some i. The string (zi, Zm) is defined as the first phrase of z with respect to x. Next one starts from Zm+i and 
find, in a similar manner the longest sub-sequence {zm+i, Zn) which appears in x, and so on. The procedure ends 
once the entire sequence z has been parsed with respect to x. If c(z|x) is the number of phrases in z with respect to 
X, Merhav and Ziv demonstrate that for two sequences of length n the quantity 



A(z||x) — — [c(z|x) Inri — c(z) Inc(z) 
n 



(13) 



converges as n ^ oo to the relative entropy between the two sources that emitted the two sequences z and x. Notice 
that c(z) Inc(z) is the measure of the complexity of the sequence z obtained by a self-parsing procedure like the one 
defined by LZ78. 

It is important to remark how the convergence of the Merhav-Ziv method to the relative entropy is extraordinarily 
fast if compared with the one of LZ77 algorithm for instance, since here the convergence is, as Merhav and Ziv 
write 1^, almost exponential. This fact has the important consequence that one can estimate the complexity of a 
given sequence more effectively by computing its relative entropy with respect to itself, more that in an absolute way. 



IV. RELATIVE ENTROPY AND LEARNING 



In this section we describe how the LZ77 algorithm zips a file obtained by appending a file B of length to a 
file A of length La- The files A and B are emitted by two ergodic sources with ergodic measures given by pA and ps 
respectively. We will use the symbols A and B to denote indifferently the files and their sources. 

In particular it is important to understand how the second file is encoded once the sequential zipper starts reading 
it. Very roughly what happens is the following. First of all the zipper encodes file A. Then it begins encoding file B. 
Initially the zipper will find the longest match of the file B in the file A. After a while however, longer is the fraction 
of B already analyzed, larger will be the probability to find the longest match in file B itself. Asymptotically the 
longest matches of file B will be found only inside B. This means that we can roughly describe this process as a two 
step process: in a first time the zipper tends to optimize the coding for the A part while in a second time it encodes 
the B file with the coding obtained for the A part (transient) as well as with the statistics proper of the B file (which 
will asymptotically dominate). For these reasons the zipping procedure oi A + B can be seen as a sort of learning 
process. 

It is convenient here to consider the following idealized problem. Let a be an infinite sequence extracted with 
measure pb- Let ua a sequence of length La extracted with the measure pA, and ub a sequence of length Lb 
extracted from the measure pb- Let us define the function P{La, Lb) as the probability that the longest subsequence 
of a A matching a subsequence of a is longer than the longest subsequence of <tb matching a subsequence of a. In 
other words P{La, Lb) is the probability that a finds a longer part of itself in aA rather than in ctb. Moreover let us 
define P(lnL^,lniB) P{La, Lb)- 
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Once fixed the notation we can go back to the original problem. In this case P{La,Lb) will be the probability 
that, once the zipper starts scanning the B part of the A + B file, it finds a matching in the A part rather than in 
the B part. 

We can say that the typical distance between two occurrences of the same substring is inversely proportional to the 
probability of the substring itself. Denoting with the relative entropy (for symbols), an argument based on 

the Shannon-McMillan theorem |^ shows that the probability of occurrence of a string of length n of the sequence 
a with respect to the measure pA is asymptotically given by g-n-lhs+diBWA)] ^ -y^g g^^.^ assuming that (i(i?||A) is a 
well defined finite limit of ([l^) for large N (we shall see in the following that this is not a serious limitation for the 
experiment): 

d(S||A)= hm ^diq^WpN). (14) 

Therefore the length ua of the longest match found in A will be approximately obtained by imposing 
L^e~"[''^+^'^-^ll"^'l — 1, whose inversion gives 

^^ = h, + diB\\Ay ^''^ 

Analogously the length of the longest match found in the part of file B already encoded will be given approximately 
by 

ub = — — . (16) 



Therefore we expect that if 



1^ « ^ii^i (17) 

hB hB+d{B\\A) ^ 

the longest match will be found in A, i.e. P{La, Lb) — 1, while if 

In / In La 

hB+d{B\\A) ^^^^ 

one expects to find it in B, i.e. P{La, Lb) — 0. 

It is important now to focus more precisely on the transient region where, as already noticed, it takes place a sort of 
learning process. In order to do this we first consider the case in which the two sequences A and B are (0, 1) Bernoulli 
sequences of length La and Lb respectively. Afterward we shall try to generalize the result. 

The first source emits with probability p and 1 with probability 1 ~ p. The second source emits emits with 
probability q and 1 with probability 1 — g. Therefore, in a typical sequence of length n emitted by the second source, 
will appear approximately qn times while 1 will appear approximately (1 ~ q)n times. More precisely we can say 
that mo (the number of zeros in the second sequence) is approximately a Gaussian random variable with average qn 
and variance 0{n). 

By neglecting the fluctuations of ttiq one has that the probability of this sequence with respect to the measure of 
the first source will be approximately given by 

This expression is nothing that e-"[^B+'*(s|l'4)]^ confir ming, at least in this particular case, the previous result. 

Now let us take into account the fluctuations, toq has random fluctuations of order y/n around its average. This 
fluctuations induces fluctuations of the probability of this string with respect to the measure pb- We then expect 
fluctuations of order n = 0(\/ln L) of the length ua of the longest match found in the first string. The same is true 
for Ub- This leads us to conjecture that, as La, Lb oo, P(x, y) converges to a unique function under this scaling, 
i.e. 



P(x,y)^.^,_oo/(^==), (20) 
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where a = TT^qr^^jj^- On the basis of large deviations theory we expect this conjecture to be vahd for sequences 
with short term memory, i.e. where the correlations decay sufficiently fast. In the next section we shall numerically 
check this conjecture. 

The rigorous analysis of the fluctuations of the length of the longest match is a very interesting and difficult problem 
(see [0 and @). 

V. NUMERICAL RESULTS 

The hypothesis for the scaling form P{x,y) = f [ J introduced in the previous section for the so-called 

learning function, can be tested for finite size symbol sequences generated according to some stochastic rule, e.g. with 
pseudo-random number generators or with some kind of non trivial dynamical systems. In this section we shall check 
this hypothesis in three cases featuring an increasing complexity: simple Bernoulli schemes, Markov processes and 
Non-Markovian processes (using in particular the symbolic dynamics generated by the Lozi map). 

A. Bernoulli scheme 

The simplest random sequence of symbols is generated by a Bernoulli scheme: at each time t the symbol S{t) is 
with probability p and 1 with probability 1 — p, with p £ [0, 1]. This is the sequence of biased (unfair) coin tosses. In 
this case it is very easy to see that h = = = — [plnp + [1 — p) ln(l — p)] for every > 1, and the effective 
measure complexity is C = 0. 

We have generated a sequence A of O's and I's of length La using a Bernoulli scheme with a probability pa for O's, 
and then a set of 5000 sequences B of length L^"*^ where O's occur with probability pB- For these cases the relative 
entropy (for sequences of any length) is given by 

d{pB\\pA)^PBHPB/pA) + (1 -pB)ln((l -pb)/(1 - pa)) (21) 
For each sequence of this set, the following numerical experiment has been performed: 

1. A sequence AB (of length La + Lb) is obtained appending the B sequence to the end of the A sequence; 

2. One starts scanning the sequence AB from the point i = igtart = La + 1, i-c. from the first character of the 
sequence B; 

3. One looks for the longest sub-sequence that: 

(a) starts at i and, 

(b) is identical to a sub-sequence contained in the part [1, i] of the joint sequence AB. 
The length of this maximum sub-sequence is called rimax- 

4. The index i is increased by rimax- If « < La + Lb"^^ the algorithm goes to ^, otherwise the algorithm stops. 

In the above procedure, one keeps track of the statistics of the sub-sequence matchings: in particular we are 
interested to the number of sub-sequences found in A or in i? as a function of Lb = i — i start- At the beginning of 
the scanning procedure most of the matchings are found in A. When is large enough, sub-sequence matchings 
found in B can be competitive with their length against the ones found in A. The procedure of averaging over many 
"realizations" of sequence B allows us for a smooth statistics, i.e. a smooth curve P{La, Lb) vs. Lb with fixed La- 

Fig. reports the curves obtained with the above procedure for P{La, Lb) vs. Lb for different values of La and 
different choices of the pair {pa,Pb), as well as their collapse using the scaling function (|20|). The collapse is indeed 
very satisfying, bringing the first evidence for the conjecture in (vw. In the picture is also shown the failure of the 
scaling form when a is too small (pluses and crosses in the inset, not reported in the main plot). This happens 
when the two sequences are too different or when the second sequence has an entropy h very low: in both cases the 
convenience of parsing the sub-sequences of B with sub-sequences of its own past (and not from A) comes too early, 
as can be seen in the inset of the figure. As a consequence of this, the length of A does not matter for the parsing of 
sequence B and the two curves obtained with different La (those with a = 0.156) are identical. For those curves, the 
scaling obviously fails. 
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(ln(Lj^)-aln(L^))/(ln(Lg)+ln(L^)) 



FIG. 1: Collapse of P{La, Lb) versus the rescaled coordinate (discussed in the text), for different pairs of Bernoulli processes 
with different probabilities of symbol "zero" {pa,Pb) and with different lengths La of buffer A. In the inset the same data 
are shown versus Lb, i-e. without any rescaling. The values of a are the following: 0.643 for (pa,Pb) ~ (0.3,0.7), 0.768 for 
{pa,Pb) = (0.4,0.7), 0.892 for {pa,Pb) = (0.4,0.6), 0.156 for {pa,Pb) = (0.1,0.9). 

B. Markovian sequences 

The logical step after having tested the scaling conjecture ( pO| ) on Bernoulli schemes, is a test using sequences 
generated by means of Markov chains. A Markov chain is a random process with discrete space of states, where the 
probability of every state is determined by one or more previous states. A Markov chain is therefore a Bernoulli process 
generalized to any number of symbols and provided with memory. The order of Markov chains is the number of previous 
states influencing the present: e.g. for a Markov chain of order k = 1 the probability of having a certain symbol 
depends only on the previous symbol and is determined by its conditional probability Wij = P{S{t) = j\S{t— 1) = i). 
We have tested the scaling hypothesis on the Lempel-Ziv parsing procedure of pairs of two symbols, order one, 
symmetric Markov chains. This means that both A and B are sequences of O's and I's and that their transition 
matrix is of the form: 

W=(,'^ ^-'^] (22) 
\1 — w w J 

with w G [0, 1] the probability of repeating the previous symbol. The sequences A and B have different transition 
matrices, that is w — wa for A and w = wb for B. In practice a sequence obtained with w near 1 is something like 
11111100000011111100000 . . . , while a sequence obtained with w near is like 010101001010101101010 .... 

For a long sequence of this kind the probability P{S) is stationary and if^ = Hi + {N — l)h, that is C = Hi — h. 
Moreover we are interested in the quantity (in the following referred to as block cross entropy) hj^ = i^7v+i — H^ vs. 
N, where 

Hn = - PB{CN)lnPAiCN)^ H^ + DNiB\\A) (23) 

with Pa{Cn) the frequency of the iV-sequence Cn in the sequence A (and analogously for Pb{Cn)) and {Cat}* is the 
set of TV-sequences contained both in A and in B. If we consider infinite sequences A and B and a two state Markov 
process (as the one introduced in this section) then {Cn}* = {Cat}, i.e. the whole set of 2^ sequences of length N is 
explored by both dynamics. For the kind of Markov chain described by the transition matrix in (p^), we can therefore 
calculate 
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Hn = Hi-{N -1) PB{S^)W,'^lnW,j (24a) 

hN = - J2 PBiS^)W,'^liiW,j ^hA + diB\\A). (24b) 
{s.s,} 

More in general the above formulas hold if is positive when W^^ is positive. 

In fig. ^ we show the effects of finiteness of the sequences A and B on the block entropy Hn and on the block cross 
entropy ft. a?: for finite sequences A and B, even in the case of two state Markov chains, the sets of words of length N 
may not coincide. A and B are sequences of length 20000 generated with the symmetric one-step Markov processes 
with different transition matrices W, i.e. with different parameters wa and wb- 




FIG. 2: Left: Hn — Hn-i vs. TV for a Markov process of order 1 with symmetrical transition matrix (see eq. (|2^)) calculated 
numerically using a sequence of 20000 symbols, for different values of the parameter w. The plateau (reached at A*' = 2) 
corresponds to the theoretical h, while the successive decay of the curves is due to poor statistics. Right: cross entropy 
hN = Hn — Hn-1 for different pairs {A,B) of such Markov processes, characterized by parameters wa,wb- The plateau (put 
in evidence by dashed lines) correspond to the theoretical value h. 

It can be seen that the plateau representing h is reached at N — 2, as expected for Markov chains of order k — 1. 
Moreover, it can be seen the effect of finite size: the sequences considered are 20000 symbols long, therefore, invoking 
the Shannon-McMillan theorem, one has that N must be not too large in order to satisfy the condition that the number 
of typical A^-sequences be much smaller than the length of the sequence, i.e. Af = exp{hN) <C 20000. Otherwise the 
statistics becomes too poor and Hn rapidly departs from h. In the right plot of fig. || we show the behaviour of hisf. 
the first plateau of the curves in this graph provides an estimate of the "cross entropy" h — d{B\\A) + hs, where 
(i(i?||yl) is the KuUback-Leibler entropy relative to the pair of sequences A and B, while Hb is the Shannon entropy 
of sequence B. This figures shows how finite size effects appear in the computation of (i(i3||A), well before those 
appearing in the computation of h; this is a direct consequence of the operative definition used in this computation: 
in order to have a good estimate oi a large amount of A^-sequences common both to A and B is indeed needed, 
reducing the value of the finite size cut-off. The scaling of P{La,Lb) for pairs of Markov sequences is shown in 
figure ^. Again a good collapse is obtained using the previously proposed scaling form It is also clear that the 
collapse fails for pairs of processes with a ^ 1, i.e. the pairs with the strongest difference in the transition matrix. 

C. Non-Markovian sequences: Lozi map symbolic dynamics 

It is interesting to probe a class of signals (i.e. sequences) with a higher degree of complexity than simple Markov 
chains. Dynamical systems are a rather natural source of non-trivial signals. A symbolic sequence can be associated to 
the dynamical system by means of a partition of the phase space f2, i.e. {uji} with m elements such that IJj^i tOi = fl 
and (jJiCiiVj = for every i and j in [1, m]. Every trajectory x(t) is therefore mapped into a sequence of symbols whose 
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(InCLg) - a ln(L^))/(ln(L^) + In(Lg)) 



FIG. 3: Collapse of P[La,Lb) versus the rescaled coordinate for Markov processes of order 1 and symmetric transition matrix 
(see eg . (p2[)) with different values of the pairs (wajWe), with La ~ 20000. In the figure are also indicated the values of a, 
see ( |20[ ) . In the inset the data without rescaling. 

values are the m symbols of the alphabet. An interesting non-trivial example is the symbolic dynamics obtained with 
a binary partition of the x variable of the Lozi map, defined as: 

x{n + 1) ^ -a\x{n)\ + y{n) + 1 (25a) 
y{n + 1) = bx{n) (25b) 

where a and b are parameters. The sequence of symbols used in the following test is obtained taking when x < 
and 1 when a; > 0. For b — 0.5, numerical studies show that the Lozi map is chaotic for a in the interval (1.51, 1.7). 
For a discussion of the Lozi map, computation of Lyapunov exponents and representation of its symbolic dynamics 
in terms of Markov chains, see p5[ . 

Fig. H reports the numerical computation of Hn and i/jv (the relative block cross entropies) for several sequence 
lengths, using always the same pair of processes = 1.56 and as = 1.52. The aim is putting in evidence finite 
size effects as well as estimating Shannon and Kullback-Leibler entropies needed for the collapse of P{La, Lb)- The 
estimate of (i(i3||^) and hs and therefore of a is obtained with a level of confidence of 10%. 

More specifically Fig. ^ is particularly enlightening from the point of view of the meaning of the effective measure 
complexity C. A naive order 1 Markovian approximation of the map is far from reproducing the dynamical properties 
of the Lozi map. This can be appreciated in Fig. |4[ noting that C is not small. 

Finally, in Fig. |^ it is shown that the collapse of the learning curves p{x, y) is very well verified, using again averages 
on the B sequence (i.e. different initial conditions) and different lengths for the A sequence. 

Let us note that in the symbolic sequence generated by the Lozi map, for a given value of the parameter a, some 
words are forbidden and, in general one has different forbidden sequences for different values of a. Therefore the limit 
limAT^oo jfD]\[{B\\A) does not exists. Nevertheless using as d(i3||A) the value extrapolated from moderate values of 
TV one has a nice agreement with the theoretical argument. 

VI. AN EXPERIMENT OF RECOGNITION 

The last set of results concerns one of the main motivation of this analysis, i.e. its practical applications. The 
algorithm proposed in ||l^, ^ has its main justification in its efficiency on the framework of sequence recognition: 
the algorithm is able to provide an estimate of the Kullback-Leibler entropy of a sequence of unknown provenance 
relatively to a set of sequences whose provenance is certain (known sources) and used as reference sequences, giving 
the most "similar" sequence and therefore the most probable source for the sequence of unknown provenance. In this 
context, we have checked that this recipe well recognizes a symbolic sequence drawn from the class of Lozi maps. 
Though the results are very preliminary and a systematic analysis should be in order, some interesting conclusions 
can be drawn. 
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FIG. 4: Hm and Hm versus A'^ for sequences of symbols obtained with a binary partition of the Lozi map. The Hn are 
calculated using Lozi map with parameter a = 1.52 and a — 1.56. The Hn are calculated using pairs of Lozi map with 
aA = 1.56 and as — 1.52. All calculations have been performed with sequences of different length L, to probe finite size effects. 




FIG. 5: Collapse of P{La, Lb) versus the rescaled coordinate for sequences of symbols obtained with a binary partition of 
the Lozi map with parameters pairs (aA,as) = (1.56, 1.52), using an estimate of a = 0.78 obtained using the values /is = 0.15 
and ^ = 0.19 (see Fig. 0). In the inset the same data are shown versus Lb, i.e. without rescaling. 



Fig. H reports the result of this test. A Lozi map with a = 1.6, h = 0.5 and initial condition a; = 0.1, y = 0.1 has 
been used to generate the sequence A, of length 10000, that will be used as unknown sequence. As probing sequences 
we have generated two sets of sequences, B and B* respectively, obtained with Lozi maps with the parameters b — 0.5 
and gb = clb- varying between 1.52 and 1.7. The sequences B has length of 10000 while sequence B* has length of 
5000 or 1000. The quantities plotted in the inset are the lengths of the compressed code (with the LZ77 algorithm, 
see the discussion in paragraph II), i.e. C{X) is the length of the code obtained by compressing the sequence X. 
Data relative to the compression of the sequences B + B* and A + B* have been obtained by averaging over 100 
different choices of initial conditions. The quantity computed and reported in the main graph is an estimate of the 
KuUback-Leibler entropy d{B\\A), as a difference (per bit) between C(A + B*) ~ C{A) and C{B + B*) - C{B) which 
are the estimates of (i(i3||A) + hs and hs respectively. The bottom plot shows very well how this simple recipe leads 
to a perfect recognition of the correct value of a = 1.6: the estimate of the KuUback-Leibler entropy has in fact an 
absolute minimum for that value. 

In figure || one can also appreciate the usefulness of the theoretical analysis of section IV, i.e. the fact that is 
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a good estimate of the best length Lb of the probe sequences B to obtain the optimal resolution in the recognition 
process. In fact in section [V we conjectured (and successively verified with numerical experiments) that when Lg 
is smaller than the cross-over length LJI, the LZ77 algorithm is encoding the sequence B with the "language" of A 
and therefore the length of the encoded sequence is effectively a measure of the distance between the two languages. 
Using the previous value a — 0.78 as a rough estimate for every other choice of the map parameter a, and given 
La = 10000, one obtains for the crossover length ~ 1300. In the figure, the resolution power of the LZ77 algorithm 
with Lb = 1000 is much higher than that with Lb = 5000. 
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FIG. 6: Estimate, by means of LZ77 compression, of D{B\\A) (see text) of the KuUback-Leibler entropy relative to different 
pairs {A, B) of sequences of symbols: each pair is composed by a fixed sequence A obtained as a binary partition of a Lozi 
map with parameter aA = 1.6 and a variable sequence B obtained either as a binary partition of a Lozi map witli variable 
parameter as. The sequences have length La = Lb = 10000. The estimate of the KuUback-Leibler entropy has its minimum 
in correspondence of the pair {A, A) (i.e. when B comes from a Lozi map with as ~ a a)'- this indicates that this estimate 
of D{B\\A) is capable of recognizing in the space of Lozi maps. In the inset the lengths of the LZ-compressed sequences are 
reported, where B* is always a sequence of the same kind of B (note that L% ~ 1300 and therefore L% — 1000, and Lg = 5000 
are below and beyond the crossover threshold respectively) 



VII. CONCLUSIONS 



We have studied the properties of standard sequential compression algorithms in the problem of information extrac- 
tion from sequences of characters. We have in particular analyzed the learning process that these algorithm perform 
when they are used to compress heterogeneous data, i.e. data coming from different sources. 

The typical benchmark for this study is a finite sequence of La + Lb symbols obtained appending a sequence of Lb 
symbols emitted by a source i? to a sequence of La symbols emitted by a source A. An algorithm like LZ77 0, after 
having processed the A part of the sequence, starts encoding the B part using the knowledge acquired while zipping 
the A part; after a transient the compression algorithm starts encoding the B part using the knowledge coming only 
from the B part already processed (i.e. the zipper starts learning the B part). We have made a scaling hypothesis 
that characterizes this transient process in terms of the entropy of the source B and the KuUback-Leibler divergence 
between the two sequences. 

We have studied the finite size scaling (i.e. incorporating fluctuations due to the finite size of the sequences under 
investigation) by means of numerical experiments on three sets of data coming from different sources: the Bernoulli 
scheme, the Markov chain of first order (with symmetric transition matrix) and the symbolic dynamics obtained with 
a binary partition of the Lozi map. These three examples feature an increasing complexity: the Bernoulli scheme 
emits sequences of uncorrclated random symbols; the Markov chain of first order is the simplest way to enforce 
correlations among symbols in the sequences; finally the Lozi map has the property of having an higher effective 
measure complexity ]35[ |. The scaling hypothesis is very well verified in all the cases investigated, pointing out the 
generality of the result. 

These results have a practical importance in the analysis of a recently proposed scheme that computes the informa- 
tional remoteness between two sequences pl|] : in fact this scheme employs a variant of the LZ77 algorithm and gives 
the best estimate of the remoteness (KuUback-Leibler divergence) when the length of the second sequence is chosen of 
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the order of the threshold value of the learning function we have introduced in this work. We have investigated quan- 
titatively this point, showing that the resolution power of the recognition scheme proposed in |pl[| is highly improved 
when the length of the second sequence is chosen according to the analysis of the transient. Sequences too short or 
too long can give bad estimates of the KuUback-Leibler divergence and therefore a big uncertainty in the recognition 
of similar sequences. 

Another important field of application is that of the segmentation of heterogeneous sequences, i.e. the identification 
of the boundaries between regions featuring very different properties which, depending on the sequences considered, 
can correspond to very different phenomena (catastrophic events in geophysical time series, or boundaries between 
different sections in genetic sequences just to quote a couple of examples). In all these cases one could try to exploit 
the features of data compression techniques at the interface between heterogeneous regions in order to define and 
optimize suitable observables sensitive to sudden changes. 
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